This article provides an overview of the open source and free software tools that are available for patent analytics. The aim of the article is to serve as a quick reference guide for some of the main tools in the tool kit. We will go into some of these tools in more depth elsewhere in the WIPO Open Source Patent Analytics Manual and leave you to explore the rest of the tools for yourself.
Before we start it is important to note that we cover only a fraction of the available tools that are out there. We have simply tried to identify some of the most accessible and useful tools. Data mining and visualisation are growing rapidly to the point that it is easy to be overwhelmed by the range of choices. The good news is that there are some very high quality free and open source tools out there. The difficulty lies in identifying those that will best serve your specific needs relative to your background and the time available to acquire some programming skills. That decision will be up to you. However, to avoid frustration it will be important to recognise that the different tools take time to master. In some cases, such as R and Python, there are lots of free resources out there to help you take the first steps into programming. In making a decision about a tool to use, think carefully about the level of support that is already out there. Try to use a tool with an active and preferably large community of users. That way, when you get stuck, there will be someone out there who has run into similar issues who will be able to help. Sites such as Stack Overflow are excellent for finding solutions to problems.
This article is divided into 8 sections:
In some cases tools are multifunctional and so may appear in one section where they could also appear in another. Rather than repeating information we will let you figure that out.
Quite a number of free tools are available for multi-purpose tasks such as basic cleaning of patent data and visualisation. We highlight three free tools here.
1. Open Office
Many patent analysts will use Excel as a default programme including basic cleaning of smaller datasets. However, it is well worth considering Apache Open Office as a free alternative. While patent analysis will typically use the Spreadsheet (Open Office Calc) there is also a very useful Database option as an alternative to Microsoft Access.
Google Sheets require a free Google account and those who are comfortable with Excel may wonder why it is worth switching. However, Google Sheets can be shared online with others and there are a large number of free add ons that could be used to assist with cleaning data such as Split Names or Remove duplicates as shown below.
Fusion Tables are similar to Google Sheets but can work with millions of records. However, it is worth trying with smaller datasets to see if Fusion Tables suit your needs.
Fusion Tables appear very much like a spreadsheet. However, the Table also contains a cards feature which allows each record to be seen as a whole and easily filtered. The cards can be much easier to work with than the standard row format where information in a record can be difficult to take in. Fusion Tables also attempts to use geocoded data to draw a Google Map as we can see in the second image below for the publication country from a sample patent dataset.
1. Open Refine (formerly Google Refine)
A fundamental rule of data analysis and visualisation is: rubbish in = rubbish out. If your data has not been cleaned in the first place, do not be surprised if the results of analysis or visualisation are rubbish.
An in depth article is available here on the use of Open Refine, formerly Google Refine, for cleaning patent data. For patent analytics Google Refine is a very important free tool for cleaning applicant and inventor names.
A number of platforms provide data cleaning facilities and it is possible to do quite a lot of basic cleaning in either Open Office or Excel. Open Refine is the most accessible tool for timely cleaning of patent name fields. In particular, it is very useful for splitting and cleaning thousands of patent applicant and inventor names.
There are an ever growing number of data mining tools out there. Here are a few of those that have caught our attention with additional tools listed below.
A very powerful tool for working with data and visualising data using R and then writing about it (this article, and the wider Manual, is entirely written in RStudio). While the learning curve with R can be intimidating a great deal of effort goes in to making R accessible through tutorials such as those on DataCamp, webinars, R-Bloggers and Stack Overflow and free university courses such as the well known John Hopkins University R Programming Course on Coursera. Indeed, as with Python, there is so much support for users at different levels that it is hard ever to feel alone when using R and RStudio.
To get started with R download RStudio for your platform by following these instructions.
If you are completely new to R then DataCamp is a good place to start. The free John Hopkins University R Programming Course on Coursera is also very good. The John Hopkins University course is accompanied by the Swirl tutorial package that can be installed using `install.packages(“Swirl”) when you have installed R. This is a real asset when getting started.
In developing this Manual we mainly focus on R. However, we would emphasise that Python may also be important for your needs. For a recent discussion on the strengths and weaknesses of R and Python see this Datacamp article on the Data Science Wars and accomapnying excellent infographic.
Comes with a free service and a variety of tiered paid plans. RapidMiner focuses on machine learning, data mining, text mining and analytics.
An open platform for data mining.
Other data mining tools (such as WEKA and NLTK in Python are covered below). If you would like to explore other data mining software try this article for some ideas.
If you are new to data visualisation we suggest that you might be interested in the work of Edward Tufte at Yale University and his famous book The Visual Display of Quantitative Information. His critique of the uses and abuses of Powerpoint is also entertaining and insightful. The work of Stephen Few, such as Show Me the Numbers: Designing Tables and Graps to Enlighten is also popular.
Remember that data visualisation is first and foremost about communication with an audience. That involves choices about how to communicate and finding ways to communicate clearly. In very many cases the outcome of patent analysis and visualisation will be a report and a presentation. Tufte’s critique of powerpoint presentations should be required reading for presenters. You may also like to take a look at Nancy Duarte’s Resonate for ideas on polishing up presentations and storytelling. The style may not suit everyone but Resonate contains very useful messages and insights. In an offline environment, consider Katy Borner’s Atlas of Science: Visualising What We Know as an excellent guide to the history of visualisations of scientific activity including pioneering visualisations of patent activity. Bear in mind that effective visualisation takes practice and is a quite well trodden path.
There are a lot of choices out there for data visualisation tools and the number of tools is growing rapidly. For business analytics Gartner provides a useful (but subscription based)Magic Quadrant for Business Intelligence and Analytics report that seeks to map out the leaders in the field. These types of reports can be useful for spotting up and coming companies and checking if there is a free version of the software (other than a short free trial).
We would suggest thinking carefully about your needs and the learning curve involved. For example, if you have limited programming knowledge (or no time or desire to learn) choose a tool that will largely do the job for you. If you already have experience with javascript, Java, R or Python, or similar, then choose a tool that you feel most comfortable with. In particular, keep an eye out for tools with an API (application programming interface) in a variety of language flavours (such as Python or R) that are likely to meet your needs.
If you are completely new to data visualisation Tableau Public and [our walk through article]((http://poldham.github.io/tableau-patents/) are a good place to learn without knowing anything about programming. Some other tools in this list are similar to Tableau Public (in part because Tableau is the market leader). We will also provide some pointers to visualisation overview sites at the end of this section where you can find out about what is new and interesting in data visualisation.
GoogleVis package and its examples hereAn in depth article on getting started with patent analysis and visualisation using Tableau Public is available here. When your patent data has been cleaned, Tableau Public is a powerful way of developing interactive dashboards and maps with your data and combining it with other data sources. Bear in mind that Tableau Public data is, by definition, public and it should not be used with sensitive data.
The workbook can be viewed online here.
3. R and RStudio
R is a statistical programming language for working with all kinds of different types of data. It also has powerful visualisation tools including packages that provide an interface with Google Charts, Plotly and others. If you are interested in using R then we suggest using RStudio which can be downloaded here. The entire WIPO Open Source Patent Analytics Manual was written in RStudio using Rmarkdown to output the articles for the web, .pdf and presentations. As this suggests, it is not simply about data visualisation. To get started with R and RStudio try the free tutorials at DataCamp. We will cover R in more detail in other articles.
As part of an approach described as The Grammar of Graphics, inspired by Leland Wilkinson’s work, developers at RStudio and others have created packages that provide very useful ways to visualise and map data. The links below will take you to the documentation for some of the most popular data visualisation packages.
We will cover ggplot2 and ggvis in greater depth in future articles. Until then, to get started see the articles on ggplot2 on R-Bloggers and here for ggvis. Datacamp offers a free tutorial on the use of ggvis that can be accessed here. For a wider overview of some of the top R packages see Qin Wenfeng’s recent awesome R list.
Shiny from RStudio is a web application framework for R. What that means is that you can output tables and visual data from R such as those from the tools mentioned above to the web.
Shiny apps for R users allows for the creation of online interactive apps (upto 5 for free). See the Gallery for examples. See RBloggers for more examples and tutorials.
Radiant is a browser based platform for business analytics in R. It is based on Shiny (above) but is specifically business focused.
For a series of starter videos on Radiant see here.
You need to Register for a free account to really understand what this is about, try this page and select register in the top right.
5. Other Visualisation Tools
For additional visualisations see visualizing.org and Open Data Tools.
Network visualisation software is an important tool for visualising actors in a field of science and technology and, in particular, the relationships between them. For patent analysis it can be used for a range of purposes including:
The image below displays a network map of Cooperative Patent Classification Codes and International Patent Classification codes for 10s of thousands of patent documents that contain references to a range of farm animals (cows, pigs, sheep etc). The dots are CPC/IPC codes describing areas of technology. The clusters show tightly linked documents that share the same codes that can then be described as ‘modules’ or clusters. The authors of the landscape report animal genetic resources used this network as an exploratory tool to extract and examine the documents in the cluster for relevance. Distant clusters (such as) Cooking equipment and Animal Husbandry (housing of animals etc.) were discarded. The authors later used network mapping to explore and later classify the individual clusters.
As such, network visualisation can be seen as both an exploratory tool for defining the object of interest and as the end result (e.g. a defined network of actors in a specific area).
1. Gephi is Java based open source network generating software. It can cope with large datasets (depending on your computer) to produce powerful visualisations.
One issue that may be encountered, particularly by Mac users, is problems with installing the right Java version although work is under way to address this problem.
To create .gexf network files in R try the gexf package and example code and source code here. In Python try the pygexf library and for anything else such as Java, Javascript C++ and Perl see gexf.net.
2. NodeXL
For die hard Excel users, NodeXL is a plug in that can be used to visualise networks. It works well.
Cytoscape is another network visualisation programme. It was originally designed for the visualisation of biological networks and interactions but, as with so many other bioinformatics tools, can be applied to a wider range of visualisation tasks.
We mainly have experience with using Gephi (above) but Cytoscape is well worth exploring and may not suffer from the Java version issues that have affected Mac users (in particular) when working with Gephi. Cytoscape also works with Windows, Mac and Linux.
4. Pajek
This is one of the oldest and most established of the free network tools and is Windows only (or run via a Virtual Machine). It is widely used in bibliometrics and can handle large datasets. It is a matter of personal preference but tools such as Gephi may be superceding Pajek because they are more flexible. However, Pajek may possibly have an edge in precision, ease of reproducibility and the important ability to easily save work that Gephi can lack as a Beta programme.
Data can also be exported from Pajek to Gephi for those who prefer the look and feel of Gephi.
5. VOS Viewer
VOS Viewer from Leiden University is similar to Gephi and Cytoscape but also presents different types of landscape (as opposed to pure network node and edge visuals). The latest version can also speaks to both Gephi and Cytoscape. It is worth testing for different visual display options and its ability to handle Web of Science and Scopus bibliographic data.